INTERSPEECH.2008 - Speech Synthesis | Cool Papers

#1 Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge [PDF] [Copy] [Kimi¹]

Authors: Zhen-Hua Ling ; Korin Richmond ; Junichi Yamagishi ; Ren-Hua Wang

This paper presents a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a Hidden Markov Model (HMM)-based parametric speech synthesis system. In contrast to model adaptation and interpolation approaches for speaking style control, this method is driven by phonetic knowledge, and target speech samples are not required. The joint distribution of parallel acoustic and articulatory features considering cross-stream feature dependency is estimated. At synthesis time, acoustic and articulatory features are generated simultaneously based on the maximum-likelihood criterion. The synthetic speech can be controlled flexibly by modifying the generated articulatory features according to arbitrary phonetic rules in the parameter generation process. Our experiments show that the proposed method is effective in both changing the overall character of synthesized speech and in controlling the quality of a specific vowel.

#2 Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis [PDF] [Copy] [Kimi]

Authors: Yi-Jian Wu ; Keiichi Tokuda

A minimum generation error (MGE) criterion had been proposed to solve the issues related to maximum likelihood (ML) based HMM training in HMM-based speech synthesis. In this paper, we improve the MGE criterion by imposing a log spectral distortion (LSD) instead of the Euclidean distance to define the generation error between the original and generated line spectral pair (LSP) coefficients. Moreover, we investigate the effect of different sampling strategies to calculate the integration of the LSD function. From the experimental results, using the LSDs calculated by sampling at LSPs achieved the best performance, and the quality of synthesized speech after the MGE-LSD training was improved over the original MGE training.

#3 Robustness of HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Junichi Yamagishi ; Zhen-Hua Ling ; Simon King

As speech synthesis techniques become more advanced, we are able to consider building high-quality voices from data collected outside the usual highly-controlled recording studio environment. This presents new challenges that are not present in conventional text-to-speech synthesis: the available speech data are not perfectly clean, the recording conditions are not consistent, and/or the phonetic balance of the material is not ideal. Although a clear picture of the performance of various speech synthesis techniques (e.g., concatenative, HMM-based or hybrid) under good conditions is provided by the Blizzard Challenge, it is not well understood how robust these algorithms are to less favourable conditions. In this paper, we analyse the performance of several speech synthesis methods under such conditions. This is, as far as we know, a new research topic: "Robust speech synthesis." As a consequence of our investigations, we propose a new robust training method for the HMM-based speech synthesis in for use with speech data collected in unfavourable conditions.

#4 Improving preselection in unit selection synthesis [PDF] [Copy] [Kimi¹]

Authors: Alistair Conkie ; Ann Syrdal ; Yeon-Jun Kim ; Mark Beutnagel

Unit selection synthesis is a method of selecting and concatenating speech segments from a large single-speaker audio database to synthesize utterances. Selection is based on assigning target and concatenation costs to units and then finding a lowest cost sequence of units that will synthesize a given utterance. In order to synthesize efficiently, it is necessary to limit the number of units considered in the unit selection cost network, a part of the process called preselection. This paper examines the role of preselection in unit selection synthesis. We refine the existing process of preselection by adding multiple phone sets to the list of features considered. We present experimental results that demonstrate better database usage and significantly increased synthesis quality using this new method.

#5 Efficient join cost computation for unit selection based TTS systems [PDF] [Copy] [Kimi¹]

Authors: Feng Ding ; Jani Nurminen ; Jilei Tian

A new efficient join cost calculation technique for unit selection based synthesis is proposed. The acoustic features representing the spectral content at the unit boundaries are encoded using multi-stage vector quantization. After applying pseudo-gray coding, the join costs are directly approximated based on the stage-wise codebook indices. As a result, both the memory requirement and the computation complexity are effectively reduced at the same time, making the technique especially suitable for embedded text-to-speech systems. Experiments are carried out comparing the proposed scheme with the original baseline technique that operates in a lossless manner using the uncompressed acoustic data and similarity measurement. Based on the experimental findings, the use of the proposed technique seems to perfectly maintain the speech quality despite the considerable reduction in complexity and memory usage.

#6 A phonetic assessment of cross-language voice conversion [PDF] [Copy] [Kimi²]

Authors: Kayoko Yanagisawa ; Mark Huckvale

Cross-language voice conversion maps the speech of speaker S1 in language L1 to the voice of speaker S2 using knowledge only of how S2 speaks a different language L2. This mapping is usually performed using speech material from S1 and S2 that has been deemed "equivalent" in either acoustic or phonetic terms. This study investigates the issue of equivalence in more detail, and contrasts the performance of a voice conversion system operating in both mono-lingual and cross-lingual modes using Japanese and English. We show that voice conversion impacts the intelligibility of the converted speech, but to a significantly greater degree for cross-language conversion. A phonetic comparison of the monolingual and cross-language converted speech suggests that consonantal information is degraded in both conditions, but vowel information is degraded more in the cross-language condition.

#7 Synthesis by generation and concatenation of multiform segments [PDF] [Copy] [Kimi¹]

Authors: Vincent Pollet ; Andrew Breen

Machine generated speech can be produced in different ways however there are two basic methods for synthesizing speech in widespread use. One method generates speech from models, while the other method concatenates pre-stored speech segments. This paper presents a speech synthesis technique where these two basic synthesis methods are combined in a statistical framework. Synthetic speech is constructed by generation and concatenation of so-called "multiform segments". Multiform segments are different speech signal representations; synthesis models, templates and synthesis models augmented with template information. An evaluation of the multiform segment synthesis technique shows improvements over traditional concatenative methods of synthesis.

#8 Glottal spectral separation for parametric speech synthesis [PDF] [Copy] [Kimi¹]

Authors: João P. Cabral ; Steve Renals ; Korin Richmond ; Junichi Yamagishi

The great advantage of using a glottal source model in parametric speech synthesis is the degree of parametric flexibility it gives to transform and model aspects of voice quality and speaker identity. However, few studies have addressed how the glottal source affects the quality of synthetic speech.

#9 Improving speech systems built from very little data [PDF] [Copy] [Kimi¹]

Authors: John Kominek ; Sameer Badaskar ; Tanja Schultz ; Alan W. Black

This paper studies two ways for helping non-specialist users develop speech systems from limited data for new languages. Focused web re-crawling finds additional examples of text matching the domain as specified by the user. This improves the language model and cuts word error rate nearly in half. Iterative voice building with interleaved lexicon construction uses the voice from a previous iteration to help construct an improved voice. 4.5 hours of the user's time reduces transcription error rate from 32% to 4%.

#10 Structure to speech conversion - speech generation based on infant-like vocal imitation [PDF] [Copy] [Kimi¹]

Authors: Daisuke Saito ; Satoshi Asakawa ; Nobuaki Minematsu ; Keikichi Hirose

This paper proposes a new framework of speech generation by imitating "infants' vocal imitation". Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into a sound sequentially. In other words, they simulate a human process of reading text out. However, infants usually acquire speech generation ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose a word utterance into a sequence of phones. In this situation, as developmental psychology states, infants acquire the holistic sound pattern of words from the utterances of their parents, called word Gestalt, and they reproduce it with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from an utterance was proposed and used successfully for ASR and CALL. In this paper, a method of converting the word Gestalt back to speech is proposed and evaluated. Unlike a reading machine, our proposal simulates infants' vocal imitation.

#11 Statistical text-to-speech synthesis with improved dynamics [PDF] [Copy] [Kimi¹]

Authors: Stas Tiomkin ; David Malah

In statistical TTS systems (STTS), speech features dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over smoothing of speech features, often sounding as muffled synthesized speech. To improve feature dynamics a Global Variance approach has been suggested. However, it is computationally complex. We propose a different approach for modeling feature dynamics based on applying the DFT to the whole set of feature frames representing a phoneme. In the transform domain the inter-frame feature dynamics is then expressed in terms of inter-harmonic content, which can be modified to statistically match the dynamics of natural speech. To synthesize a whole utterance we propose a method for smoothly combining the enhanced-dynamics phonemes, which improves synthesized speech quality of STTS with similar complexity to conventional STTS.

#12 An evaluation of non-standard features for grapheme-to-phoneme conversion [PDF] [Copy] [Kimi¹]

Authors: Gabriel Webster ; Norbert Braunschweiler

Machine learning methods for grapheme-to-phoneme (G2P) conversion are popular, but the features used in the literature are most often simply a window of context letters, despite the availability of other features. In this paper, a set of features beyond the sevenletter window, termed non-standard features, are systematically evaluated for American English, using decision trees. The results show that adding non-standard features to a seven-letter window gives clear improvements for English, with the most important features being the previous three phone sequences predicted, an initial prediction of lexical stress location, and a window of vowel letters around the current letter.

#13 Towards flexible speech coding for speech synthesis: an LF + modulated noise vocoder [PDF] [Copy] [Kimi¹]

Authors: Yannis Agiomyrgiannakis ; Olivier Rosec

This paper presents an ARX-LF-based model of speech that is amenable to low-bit-rate quantization and speech modifications directly at the parametric domain. The new model successfully addresses the non-deterministic part of voiced speech by modulating noise with the glottal flow, while unvoiced speech and transients are synthesized by modulating noise with a signal-derived time envelope. The presented work is essentially a high-quality vocoder that can be used for low complexity coding/synthesis/modification of speech suitable for embedded text-to-speech applications.

#14 Evaluation of Finnish unit selection and HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Hanna Silen ; Elina Helander ; Jani Nurminen ; Moncef Gabbouj

Unit selection and hidden Markov model (HMM) based synthesis have become the dominant techniques in text-to-speech (TTS) research. In this work, we combine HMM-based signal generation with the front end originally designed for unit selection based Finnish TTS and we evaluate the prosody of the output generated by the two synthesis techniques using the same speech database. Furthermore, we study the effect that the training set size has for the prosody and intelligibility in HMM-based synthesis. The results indicate that the HMM-based approach is capable of providing better prosody than unit selection even if the training set size is severely limited. The size of the training set, however, affects the prosodic quality and intelligibility of the HMM-based synthesizer.

#15 A probabilistic trajectory synthesis system for synthesising visual speech [PDF] [Copy] [Kimi¹]

Authors: Barry-John Theobald ; Nicholas Wilkinson

We describe an unsupervised probabilistic approach for synthesising visual speech from audio. Acoustic features representing a training corpus are clustered and the probability density function (PDF) of each cluster is modelled as a Gaussian mixture model (GMM). A visual target in the form of a short-term parameter trajectory is generated for each cluster. Synthesis involves combining the cluster targets based on the likelihood of novel acoustic feature vectors, then cross-blending neighbouring regions of the synthesised short-term trajectories. The advantage of our approach is coarticulation effects are explicitly captured by the mapping. The influence of cluster targets naturally increase and decrease with the likelihood of the acoustic feature vectors.

#16 Paralinguistic elements in speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Didier Cadic ; Lionel Segalen

Corpus based text-to-speech systems currently produce very natural synthetic sentences, though limited to a neutral inexpressive speaking style. Paralinguistic elements are some of the expressive features one would most like to introduce. In this paper, we describe a new method for introducing laughter and hesitation in synthetic speech. Thanks to a small dedicated acoustic database, this method can successfully render transitions between speech and paralinguistic elements. We validate it here for French but extension to other languages should be straightforward.

#17 Building sleek synthesizers for multi-lingual screen reader [PDF] [Copy] [Kimi¹]

Authors: E Veera Raghavendra ; B. Yegnanarayana ; Alan W. Black ; Kishore Prahallad

In this paper, we are investigating the unit size: syllable, halfphone and quarter-phone to be used for speech synthesis in multi-lingual screen reader in phonetic languages such as Telugu and non-phonetic language English. Perceptual studies show that syllable-level unit performs better for Telugu and half-phone units perform better for English. While syllable based synthesizers produce better sounding speech, the coverage of all syllables is a non-trivial issue. We address the issue of coverage of syllables through approximate matching of syllable and show that such approximation produces intelligible and better quality speech than diphone units. In this paper, we also propose a hybrid synthesizer within the framework of unit selection and also show that the hybrid synthesizer built from pruned database performs as well as hybrid synthesizer built from unpruned database.

#18 Unsupervised adaptation for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Simon King ; Keiichi Tokuda ; Heiga Zen ; Junichi Yamagishi

It is now possible to synthesise speech using HMMs with a comparable quality to unit-selection techniques. Generating speech from a model has many potential advantages over concatenating waveforms. The most exciting is model adaptation. It has been shown that supervised speaker adaptation can yield high-quality synthetic voices with an order of magnitude less data than required to train a speaker-dependent model or to build a basic unit-selection system. Such supervised methods require labelled adaptation data for the target speaker. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling.

#19 Investigating festival's target cost function using perceptual experiments [PDF] [Copy] [Kimi¹]

Authors: Volker Strom ; Simon King

We describe an investigation of the target cost used in the Festival unit selection speech synthesis system [1]. Our ultimate goal is to automatically learn a perceptually optimal target cost function. In this study, we investigated the behaviour of the target cost for one segment type. The target cost is based on counting the mismatches in several context features. A carrier sentence ("My name is Roger") was synthesised using all 147,820 possible combinations of the diphones /n_ei/ and /ei_m/. 92 representative versions were selected and presented to listeners as 460 pairwise comparisons. The listeners' preference votes were used to analyse the behaviour of the target cost, with respect to the values of its component linguistic context features.

#20 Modeling Austrian dialect varieties for TTS [PDF] [Copy] [Kimi¹]

Authors: Friedrich Neubarth ; Michael Pucher ; Christian Kranzler

In this paper we discuss certain strategies for building adapted TTS systems for dialectal or regional varieties from a given standard source. The basic question is how much re-coding is necessary for a given transfer and to what extent it is possible to rely on the speech data alone. It will turn out that there are ambiguities that cannot be resolved without a certain amount of linguistic engineering. For exemplification we present two experiments dealing with Austrian standard German and Viennese dialect on the level of phonetic lexicon and orthography.

#21 HMM-based Finnish text-to-speech system utilizing glottal inverse filtering [PDF] [Copy] [Kimi¹]

Authors: Tuomo Raitio ; Antti Suni ; Hannu Pulakka ; Martti Vainio ; Paavo Alku

This paper describes an HMM-based speech synthesis system that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed system, speech is first parametrized into spectral and excitation features using a glottal inverse filtering based method. The parameters are fed into an HMM system for training and then generated from the trained HMM according to text input. Glottal flow pulses extracted from real speech are used as a voice source, and the voice source is further modified according to the all-pole model parameters generated by the HMM. Preliminary experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to a system utilizing a conventional impulse train excitation model.

#22 LTS using decision forest of regression trees and neural networks [PDF] [Copy] [Kimi¹]

Authors: Tanuja Sarkar ; Sachin Joshi ; Sathish Chandra Pammi ; Kishore Prahallad

Letter-to-sound (LTS) rules play a vital role in building a speech synthesis system. In this paper, we apply various Machine Learning approaches like Classification and Regression Trees (CART), Decision Forest, forest of Artificial Neural Network (ANN) and Auto Associative Neural Networks (AANN) for LTS rules. We used these techniques mainly for Schwa deletion in Hindi. We empirically show that the LTS using Decision Forest and Forest of ANNs outperforms the previous CART and normal ANN approaches respectively, and the non discriminative learning technique of AANN could not capture the LTS rules as efficiently as discriminative techniques. We explore use of syllabic features, namely, syllabic structure, onset of the syllable, number of syllables and place of Schwa along with primary contextual features. The results showed that use of these features leads to good performance. The Decision Forest and forest of ANNs approaches yielded phone accuracy of 92.86% and 93.18% respectively using the newly incorporated features for Hindi LTS.

#23 Automatic word stress marking and syllabification for Catalan TTS [PDF] [Copy] [Kimi¹]

Authors: Silvia Rustullet ; Daniela Braga ; João Nogueira ; Miguel Sales Dias

Stress and syllabification are essential attributes for several components in text-to speech (TTS) systems. They are responsible for improving grapheme-to-phoneme conversion rules and for enhancing the synthetic intelligibility, since stress and syllable are key units in prosody prediction. This paper presents three linguistically rule-based automatic algorithms for Catalan text-to-speech conversion: a word stress marker, an orthographic syllabification algorithm and a phonological syllabification algorithm. The systems were implemented and tested. The results gave rise to the following word accuracy rates: 100% for the stress marker algorithm, 99.7% for the orthographic syllabification algorithm and 99.8% for the phonological syllabification algorithm.

#24 Analysis of voice-quality features of speech that expresses 'anger', 'joy', and 'sadness' uttered by radio actors and actresses [PDF] [Copy] [Kimi¹]

Authors: Shoichi Takeda ; Yuuri Yasuda ; Risako Isobe ; Shogo Kiryu ; Makiko Tsuru

This paper describes the analysis of the voice-quality features of "anger", "joy", and "sadness" depending on the degree of the emotion for expressions in Japanese speech. The degrees of emotion were "neutral", "light", "medium" and "strong". Among voice-quality features, we turned to the noise level of the glottalflow waveform. We adopted the AR model and measured the noise levels of the predictive residual signal of speech that expressed each emotion. To measure a relative noise level to the signal level, the "noise-to-signal (N/S) ratio" was introduced. The analysis results showed that the relative noise levels in the residual-waveform spectra were different, i.e., the N/S ratio of each emotion was larger in the order of "anger" > "sadness". "neutral" > "joy" by approximately 4 dB.

#25 Including pitch accent optionality in unit selection text-to-speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Leonardo Badino ; Robert A. J. Clark ; Volker Strom

A significant variability in pitch accent placement is found when comparing the patterns of prosodic prominence realized by different English speakers reading the same sentences. In this paper we describe a simple approach to incorporate this variability to synthesize prosodic prominence in unit selection text-to-speech synthesis.